pp <- read_csv("https://raw.githubusercontent.com/COGS137/datasets/main/paris_paintings.csv",
na = c("n/a", "", "NA"))07-linear-models
Linear Models
Q&A
Q: Do you recommend review some statistics before attending lectures?
A: This is very much up to you. It certainly wouldn’t hurt! There will be readings connected to each lecture; these are a great place to start. It can be helpful to work through these readings and the exercises either before or after lecture!
Q: I am still not quite sure how to do HW part 2 and 3, do we download data into the folder, read the file, and plot it?
A: Yup! That’s one approach. The other approach (if the data are not available for download) would be to create the dataset on your own, estimating the values as best you can from the plot and creating a dataframe/tibble containing that information directly in your code (similar to what you did in HW01)….and then plot from there.
Course Announcements
Due Dates:
- Lab 04 due tomorrow (2/3; 11:59 PM)
- Lecture Participation survey “due” after class
- HW02 due Monday (2/6; 11:59 PM)
- Midterm Exam
- practice exam(s) will be released tomorrow; answers posted next week
- will cover material through “Effective Communication”
- will be released/posted next Friday after lab
- will be due Monday Feb 13th at 11:59 PM
- will be an Rmd document and submitted via GitHub (like everything so far)
- will be commpleted individually (open Notes; open Google)
Agenda
- Linear Models
- Quantitative Predictor
- Categorical Predictor (2 & >2 levels)
- residuals
- data transformations
tidymodels
Data: Paris Paintings
- Number of observations: 3393
- Number of variables: 61
Goal: Predict height from width
\[\widehat{height}_{i} = \beta_0 + \beta_1 \times width_{i}\]
tidymodels
- NOT a core
tidyversepackage - follows the structure of a
tidyversepackage
# should already be installed for you on datahub
library(tidymodels)Step 1: Specify model
linear_reg()Linear Regression Model Specification (regression)
Computational engine: lm
Step 2: Set model fitting engine
linear_reg() |>
set_engine("lm") # lm: linear modelLinear Regression Model Specification (regression)
Computational engine: lm
Step 3: Fit model & estimate parameters
… using formula syntax
linear_reg() |>
set_engine("lm") |>
fit(Height_in ~ Width_in, data = pp)parsnip model object
Call:
stats::lm(formula = Height_in ~ Width_in, data = data)
Coefficients:
(Intercept) Width_in
3.6214 0.7808
A closer look at model output
parsnip model object
Call:
stats::lm(formula = Height_in ~ Width_in, data = data)
Coefficients:
(Intercept) Width_in
3.6214 0.7808
\[\widehat{height}_{i} = 3.6214 + 0.7808 \times width_{i}\]
A tidy look at model output
linear_reg() |>
set_engine("lm") |>
fit(Height_in ~ Width_in, data = pp) |>
tidy()# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.62 0.254 14.3 8.82e-45
2 Width_in 0.781 0.00950 82.1 0
\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]
Slope and intercept
\[\widehat{height}_{i} = 3.62 + 0.781 \times width_{i}\]
- Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.781 inches.
- Intercept: Paintings that are 0 inches wide are expected to be 3.62 inches high, on average. (Does this make sense?)
Correlation does not imply causation
Remember this when interpreting model coefficients
Source: XKCD, Cell phones
Parameter Estimation
Linear model with a single predictor
- We’re interested in \(\beta_0\) (population parameter for the intercept) and \(\beta_1\) (population parameter for the slope) in the following model:
\[\hat{y}_{i} = \beta_0 + \beta_1~x_{i}\]
- Tough luck, you can’t have them…
- So we use sample statistics to estimate them:
\[\hat{y}_{i} = b_0 + b_1~x_{i}\]
Least squares regression
- The regression line minimizes the sum of squared residuals.
- If \(e_i = y_i - \hat{y}_i\), then, the regression line minimizes \(\sum_{i = 1}^n e_i^2\).
Visualizing residuals
Visualizing residuals (cont.)
Visualizing residuals (cont.)
Properties of least squares regression
- The regression line goes through the center of mass point, the coordinates corresponding to average \(x\) and average \(y\), \((\bar{x}, \bar{y})\):
\[\bar{y} = b_0 + b_1 \bar{x} ~ \rightarrow ~ b_0 = \bar{y} - b_1 \bar{x}\]
- The slope has the same sign as the correlation coefficient: \(b_1 = r \frac{s_y}{s_x}\)
- The sum of the residuals is zero: \(\sum_{i = 1}^n e_i = 0\)
- The residuals and \(x\) values are uncorrelated
Models with categorical explanatory variables
Categorical predictor with 2 levels
# A tibble: 3,393 × 3
name Height_in landsALL
<chr> <dbl> <dbl>
1 L1764-2 37 0
2 L1764-3 18 0
3 L1764-4 13 1
4 L1764-5a 14 1
5 L1764-5b 14 1
6 L1764-6 7 0
7 L1764-7a 6 0
8 L1764-7b 6 0
9 L1764-8 15 0
10 L1764-9a 9 0
11 L1764-9b 9 0
12 L1764-10a 16 1
13 L1764-10b 16 1
14 L1764-10c 16 1
15 L1764-11 20 0
16 L1764-12a 14 1
17 L1764-12b 14 1
18 L1764-13a 15 1
19 L1764-13b 15 1
20 L1764-14 37 0
# … with 3,373 more rows
landsALL = 0: No landscape featureslandsALL = 1: Some landscape features
Height & landscape features
m_ht_lands <- linear_reg() |>
set_engine("lm") |>
fit(Height_in ~ factor(landsALL), data = pp)
m_ht_lands |> tidy()# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 22.7 0.328 69.1 0
2 factor(landsALL)1 -5.65 0.532 -10.6 7.97e-26
Height & landscape features
\[\widehat{Height_{in}} = 22.7 - 5.645~landsALL\]
- Slope: Paintings with landscape features are expected, on average, to be 5.645 inches shorter than paintings that without landscape features
- Compares baseline level (
landsALL = 0) to the other level (landsALL = 1)
- Compares baseline level (
- Intercept: Paintings that don’t have landscape features are expected, on average, to be 22.7 inches tall
Categorical predictor with >2 levels
# A tibble: 3,393 × 3
name Height_in school_pntg
<chr> <dbl> <chr>
1 L1764-2 37 F
2 L1764-3 18 I
3 L1764-4 13 D/FL
4 L1764-5a 14 F
5 L1764-5b 14 F
6 L1764-6 7 I
7 L1764-7a 6 F
8 L1764-7b 6 F
9 L1764-8 15 I
10 L1764-9a 9 D/FL
11 L1764-9b 9 D/FL
12 L1764-10a 16 X
13 L1764-10b 16 X
14 L1764-10c 16 X
15 L1764-11 20 D/FL
16 L1764-12a 14 D/FL
17 L1764-12b 14 D/FL
18 L1764-13a 15 D/FL
19 L1764-13b 15 D/FL
20 L1764-14 37 F
# … with 3,373 more rows
- school from which painting came (details in a few slides)
Relationship between height and school
linear_reg() |>
set_engine("lm") |>
fit(Height_in ~ school_pntg, data = pp) |>
tidy()# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14.0 10.0 1.40 0.162
2 school_pntgD/FL 2.33 10.0 0.232 0.816
3 school_pntgF 10.2 10.0 1.02 0.309
4 school_pntgG 1.65 11.9 0.139 0.889
5 school_pntgI 10.3 10.0 1.02 0.306
6 school_pntgS 30.4 11.4 2.68 0.00744
7 school_pntgX 2.87 10.3 0.279 0.780
Dummy variables
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14.0 10.0 1.40 0.162
2 school_pntgD/FL 2.33 10.0 0.232 0.816
3 school_pntgF 10.2 10.0 1.02 0.309
4 school_pntgG 1.65 11.9 0.139 0.889
5 school_pntgI 10.3 10.0 1.02 0.306
6 school_pntgS 30.4 11.4 2.68 0.00744
7 school_pntgX 2.87 10.3 0.279 0.780
- When the categorical explanatory variable has many levels, they’re encoded to dummy variables
- Each coefficient describes the expected difference between heights in that particular school compared to the baseline level
Categorical predictor with 3+ levels
| school_pntg | D_FL | F | G | I | S | X |
|---|---|---|---|---|---|---|
| A | 0 | 0 | 0 | 0 | 0 | 0 |
| D/FL | 1 | 0 | 0 | 0 | 0 | 0 |
| F | 0 | 1 | 0 | 0 | 0 | 0 |
| G | 0 | 0 | 1 | 0 | 0 | 0 |
| I | 0 | 0 | 0 | 1 | 0 | 0 |
| S | 0 | 0 | 0 | 0 | 1 | 0 |
| X | 0 | 0 | 0 | 0 | 0 | 1 |
# A tibble: 3,393 × 3
name Height_in school_pntg
<chr> <dbl> <chr>
1 L1764-2 37 F
2 L1764-3 18 I
3 L1764-4 13 D/FL
4 L1764-5a 14 F
5 L1764-5b 14 F
6 L1764-6 7 I
7 L1764-7a 6 F
8 L1764-7b 6 F
9 L1764-8 15 I
10 L1764-9a 9 D/FL
11 L1764-9b 9 D/FL
12 L1764-10a 16 X
13 L1764-10b 16 X
14 L1764-10c 16 X
15 L1764-11 20 D/FL
16 L1764-12a 14 D/FL
17 L1764-12b 14 D/FL
18 L1764-13a 15 D/FL
19 L1764-13b 15 D/FL
20 L1764-14 37 F
# … with 3,373 more rows
The linear model with multiple predictors
- Population model:
\[ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k \]
- Sample model that we use to estimate the population model:
\[ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k \]
Relationship b/w height and school
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 14.0 10.0 1.40 0.162
2 school_pntgD/FL 2.33 10.0 0.232 0.816
3 school_pntgF 10.2 10.0 1.02 0.309
4 school_pntgG 1.65 11.9 0.139 0.889
5 school_pntgI 10.3 10.0 1.02 0.306
6 school_pntgS 30.4 11.4 2.68 0.00744
7 school_pntgX 2.87 10.3 0.279 0.780
- Austrian school (A) paintings are expected, on average, to be 14 inches tall.
- Dutch/Flemish school (D/FL) paintings are expected, on average, to be 2.33 inches taller than Austrian school paintings.
- French school (F) paintings are expected, on average, to be 10.2 inches taller than Austrian school paintings.
- German school (G) paintings are expected, on average, to be 1.65 inches taller than Austrian school paintings.
- Italian school (I) paintings are expected, on average, to be 10.3 inches taller than Austrian school paintings.
- Spanish school (S) paintings are expected, on average, to be 30.4 inches taller than Austrian school paintings.
- Paintings whose school is unknown (X) are expected, on average, to be 2.87 inches taller than Austrian school paintings. ]
Prediction with models
Predict height from width
❓ On average, how tall are paintings that are 60 inches wide? \[\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}\]
3.62 + 0.78 * 60[1] 50.42
“On average, we expect paintings that are 60 inches wide to be 50.42 inches high.”
Warning: We “expect” this to happen, but there will be some variability. (We’ll learn about measuring the variability around the prediction later.)
Prediction vs. extrapolation
❓ On average, how tall are paintings that are 400 inches wide? \[\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}\]
Watch out for extrapolation!
“When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on.”1
Stephen Colbert, April 6th, 2010
Measuring model fit
Measuring the strength of the fit
The strength of the fit of a linear model is most commonly evaluated using \(R^2\).
It tells us what percent of variability in the response variable is explained by the model.
The remainder of the variability is explained by variables not included in the model.
\(R^2\) is sometimes called the coefficient of determination.
Obtaining \(R^2\) in R
- Height vs. width
glance(m_ht_wt)# A tibble: 1 × 12
r.squared adj.r.sq…¹ sigma stati…² p.value df logLik AIC BIC devia…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.683 0.683 8.30 6749. 0 1 -11083. 22173. 22191. 216055.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
# variable names ¹adj.r.squared, ²statistic, ³deviance
glance(m_ht_wt)$r.squared # extract R-squared[1] 0.6829468
Roughly 68% of the variability in heights of paintings can be explained by their widths.
- Height vs. landscape features
glance(m_ht_lands)$r.squared[1] 0.03456724
Exploring linearity
Data: Paris Paintings
Price vs. width
❓ Describe the relationship between price and width of paintings whose width is less than 100in.
Price vs. width
❓ Which plot shows a more linear relationship?
Price vs. width, residuals
❓ Which plot shows a residuals that are uncorrelated with predicted values from the model?
❓What’s the unit of residuals?
Transforming the data
- We saw that
pricehas a right-skewed distribution, and the relationship between price and width of painting is non-linear.
- In these situations a transformation applied to the response variable may be useful.
- In order to decide which transformation to use, we should examine the distribution of the response variable.
- The extremely right skewed distribution suggests that a log transformation may be useful.
- log = natural log, \(ln\)
- Default base of the
logfunction in R is the natural log:
log(x, base = exp(1))
Logged price vs. width
❓ How do we interpret the slope of this model?
Interpreting models with log transformation
m_lprice_wt <- lm(log(price) ~ Width_in, data = pp_wt_lt_100)
m_lprice_wt |>
tidy() |>
select(term, estimate) |>
mutate(estimate = round(estimate, 3))# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 4.67
2 Width_in 0.019
Linear model with log transformation
\[ \widehat{log(price)} = 4.67 + 0.02 Width \]
- For each additional inch the painting is wider, the log price of the painting is expected to be higher, on average, by 0.02 livres.
- which is not a very useful statement…
Working with logs
- Subtraction and logs: \(log(a) − log(b) = log(a / b)\)
- Natural logarithm: \(e^{log(x)} = x\)
- We can use these identities to “undo” the log transformation
Interpreting models with log transformation
The slope coefficient for the log transformed model is 0.02, meaning the log price difference between paintings whose widths are one inch apart is predicted to be 0.02 log livres.
\[ log(\text{price for width x+1}) - log(\text{price for width x}) = 0.02 \]
\[ log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right) = 0.02 \]
\[ e^{log\left(\frac{\text{price for width x+1}}{\text{price for width x}}\right)} = e^{0.02} \]
\[ \frac{\text{price for width x+1}}{\text{price for width x}} \approx 1.02 \]
For each additional inch the painting is wider, the price of the painting is expected to be higher, on average, by a factor of 1.02.
Shortcuts in R
m_lprice_wt |>
tidy() |>
select(term, estimate) |>
mutate(estimate = round(estimate, 3))# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 4.67
2 Width_in 0.019
m_lprice_wt |>
tidy() |>
select(term, estimate) |>
mutate(estimate = round(exp(estimate), 3))# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 107.
2 Width_in 1.02
Recap: Log Transformations
- Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable.
- The most common transformation when the response variable is right skewed is the log transform: \(log(y)\), especially useful when the response variable is (extremely) right skewed.
- This transformation is also useful for variance stabilization.
- When using a log transformation on the response variable the interpretation of the slope changes: “For each unit increase in x, y is expected on average to be higher/lower
by a factor of \(e^{b_1}\).”
- Another useful transformation is the square root: \(\sqrt{y}\), especially useful when the response variable is counts.
Aside: when \(y = 0\)
In some cases the value of the response variable might be 0, and
log(0)[1] -Inf
The trick is to add a very small number to the value of the response variable for these cases so that the log function can still be applied:
log(0 + 0.00001)[1] -11.51293
Recap
- Can I carry out linear regression using the
tidymodelsapproach? - Can I interpret and explain the results from a linear model with a single predictor?
- Do I understand the limitations of modelling data w/ linear regression?
- Can I describe and implement the use of a dummy variable in linear regression?
- Can I determine when logistic transformation may be appropriate? Can I interpret these results?
Suggested Reading
- R4DS Chapter 24: Model Building
- Introduction to Modern Statistics Chapter 7: Linear Regression with a Single Predictor
Footnotes
Introduction to Modern Statistics. “Extrapolation is treacherous.”↩︎